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Abstract 

In the typical analysis of a data set, a single method is selected for statistical re- 
porting even when equally applicable methods yield very different results. Examples 
of equally applicable methods can correspond to those of different ancillary statistics 
in frequentist inference and of different prior distributions in Bayesian inference. More 



broadly, choices are made between parametric and nonparametric methods and be- 
tween frequentist and Bayesian methods. Rather than choosing a single method, it can 
be safer, in a game-theoretic sense, to combine those that are equally appropriate in 
light of the available information. Since methods of combining subjectively assessed 
probability distributions are not objective enough for that purpose, this paper intro- 
duces a method of distribution combination that does not require any assignment of 
distribution weights. It does so by formalizing a hedging strategy in terms of a game 
between three players: nature, a statistician combining distributions, and a statistician 
refusing to combine distributions. The optimal move of the first statistician reduces to 
the solution of a simpler problem of selecting an estimating distribution that minimizes 
the Kullback-Leiblcr loss maximized over the plausible distributions to be combined. 
The resulting combined distribution is a linear combination of the most extreme of the 
distributions to be combined that are scientifically plausible. The optimal weights are 
close enough to each other that no extreme distribution dominates the others. The new 
methodology is illustrated by combining conflicting empirical Bayes methodologies in 
the context of gene expression data analysis. 

Keywords: ancillarity; conditional inference; combining probabilities; combining probabil- 
ity distributions; combining tests in parallel; confidence distribution; confidence posterior; 
cross entropy; game theory; imprecise probability; inferential gain; KuUback-Leibler infor- 
mation; KuUback-Leibler divergence; large-scale simultaneous inference; linear opinion pool; 
minimax redundancy; multiple hypothesis testing; multiple comparison procedure; observed 
confidence level; redundancy-capacity theorem 
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1 Introduction 



The analysis of biological data often requires choices between methods that seem equally 
applicable and yet that can yield very different results. This occurs not only with the noto- 
rious problems in frequentist statistics of conditioning on one of multiple ancillary statistics 
and in Bayesian statistics of selecting one of many appropriate priors, but also in choices 
between frequentist and Bayesian methods, in whether to use a potentially powerful para- 
metric test to analyze a small sample of unknown distribution, in whether and how to adjust 
for multiple testing, and in whether to use a frequentist model averaging procedure. Today, 
statisticians simultaneously testing thousands of hypotheses must often decide whether to 
apply a multiple comparisons procedure using the assumption that the p-value is uniform 
under the null hypothesis (theoretical null distribution) or a null distribution estimated from 
the data (empirical null distribution). While the empirical null reduces estimation bias in 



many situations (Efron, 2007), it also increases variance (Efron, 2010) and substantially 



increases bias when the data distributions have heavy tails (Bickel, 201 Id). Without any 
strong indication of which method can be expected to perform better for a particular data 
set, combining their estimated false discovery rates or adjusted p-values may be the safest 
approach. 



Emphasizing the reference class problem, Barndorff-Nielsen ( 1995 ) pointed out the need 
for ways to assess the evidence in the diversity of statistical inferences that can be drawn 
from the same data. Previous applications of p-value combination have included combining 



inferences from different ancillary statistics (Good, 1984), combining inferences from more 
robust procedures with those from procedures with stronger assumptions, and combining 



inferences from different alternative distributions (Good, 1958). However, those combination 
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procedures are only justified by a heuristic Bayesian argument and have not been widely 
adopted. To offer a viable alternative, the problem of combining conflicting methods is 
framed herein in terms of probability combination. 

Most existing methods of automatically combining probability distributions have been 



designed for the integration of expert opinions. For example, Toda (1956), Abbas (2009), and 



Kracik (2011) proposed combining distributions to minimize a weighted sum of Kullback- 



Leibler divergences from the distributions being combined, with the weights determined 
subjectively, e.g., by the elicitation of the opinions of the experts who provided the distribu- 
tions or by the extent to which each expert is considered credible. Under broad conditions, 
that approach leads to the linear combination of the distributions that is defined by those 



weights ( |Tbda[ [T956| |Kraclk| [2011| ). 

"Linear opinion pools" also result from this marginalization property: any linearly com- 
bined marginal distribution is the same whether marginalization or combination is carried 



out first (McConway, 1981). The marginalization property forbids certain counterintuitive 
combinations of distributions, including any combination of distributions that differs in a 



probability assignment from the unanimous assignment of all distributions combined ( Cooke 



1991, p. 173). Combinations violating the marginalization property can be expected to per- 



form poorly as estimators regardless of their appeal as distributions of belief. On the other 
hand, invariance to reversing the order of Bayesian updating and distribution combination 
instead requires a "logarithmic opinion pool," which uses a geometric mean in place the 



arithmetic mean of the linear opinion pool; see, e.g., Berger (1985, §4.11.1) or Clemen and 



Winkler (1999). While that property is preferable to the marginalization property from the 



point of view of a Bayesian agent making decisions on the basis of independent reports of 
other Bayesian agents, it is less suitable for combining distributions that are highly depen- 
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dent or that are distribution estimates rather than actual distributions of behef. IGenest andl 



Zidek (1986) and Cooke (1991, Ch. 11) review these and related issues. 



Like those methods, the strategy introduced in this paper is intended for combining 
distributions based on the same data or information as opposed to combining distributions 
based on independent data sets. However, to relax the requirement that the distributions 
be provided by experts, the weights are optimized rather than specified. While the new 
strategy leads to a linear combination of distributions, the combination hedges by including 
only the most extreme distributions rather than all of the distributions. In addition, the game 
leading to the hedging takes into account any known constraints on the true distribution. 
See Remark [T] on the pivotal role game theory played in the foundations of statistics. 

The game that generates the hedging strategy is played between three players: the mech- 
anism that generates the true distribution ("Nature"), a statistician who never combines 
distributions ("Chooser"), and a statistician open to combining distributions ("Combiner"). 
Nature must select a distribution that complies with constraints known to the statisticians, 
who want to choose distributions as close as possible to the distribution chosen by Nature. 
Other things being equal, each statistician would also like to select a distribution that is as 
much better than that of the other statistician as possible. Thus, each statistician seeks pri- 
marily to come close to the truth and secondarily to improve upon the distribution selected 
by the other statistician. Combiner has the advantage over Chooser that the former may 
select any distribution, whereas the latter must select one from a given set of the distribu- 
tions that estimate the true distribution or that encode expert opinion. On the other hand. 
Combiner is disadvantaged in that Nature seeks to maximize the gain of Chooser without 
concern for the gain of Combiner. Since Nature favors Chooser without opposing Combiner, 
the optimal strategy of Combiner is one of hedging but is less cautious than the minimax 
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strategies that are often optimal for typical two-player zero-sum games against Nature. The 
distribution chosen according to the strategy of Combiner will be considered the combina- 
tion of the distributions available to Chooser. The combination distribution is a function 
not only of the combining distributions but also of the constraints on the true distribution. 

Sec. [2] encodes the game and strategy described above in terms of Kullback-Leibler 
loss and presents its optimal solution as a general method of combining distributions. The 
important special case of combining probabilities is then worked out. A framework for 
using the proposed combination method to resolve method conflicts in point and interval 
estimation, hypothesis testing, and other aspects of statistical data analysis will be presented 
in Sec. [3j The framework is illustrated by applying it to the combination of three false 
discovery rate methods for the analysis of microarray data in Sec. |4j Finally, Appendices A 
and B collect miscellaneous remarks and proofs, respectively. 



2 Framework for combining distributions 
2.1 Information-theoretic background 

Let V denote the set of probability distributions on a Borel space (S, B (S)), where B (H) is 
the set of all Borel subsets of S. The information divergence of P E V with respect to Q eV 
is defined as 

z)(/>iiQ)=/dPK)iog;||. (1) 

where dP and dQ are probability density functions of P and Q in the sense of Radon-Nikodym 



differentiation with respect to the same dominating measure (Haussler, 1997). The integrand 



follows the Olog (0) = and Olog (0/0) = conventions. D {P\\Q) is also known as "informa- 
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tion for discrimination," "KuUback-Leibler information," "KuUback-Leibler divergence," and 
"cross entropy." Calling D {P\\Q) "information divergence" emphasizes its interpretation as 
the amount of information that would be gained by replacing any distribution Q with the 
true distribution P. That interpretation accords with calling 



D{P'\\P" ^ Q) = D{P'\\P")- D{P'\\Q) 



(2) 



the information gain ( Pfaffelhuber , 1977), the amount of information gained by using Q 



rather than P" G V when the true distribution is P' E V (cf. Tops0e, 2007) 



For any real parameter set $ and family V* = {P^ : G $} of probability distributions 
such that "P* C P, the distribution 



V* = arg inf sup D {P*\\Q) 



(3) 



is called the centroid of V* ( [Csiszar and Korner 2011, p. 131). Let W denote the set of all 
measures on the Borel space ($, B ($)). Reserving the term prior for Sec. |3| members of W 
will be called weighting distributions. Then P^ = Ey/V* = J P^dW (0) defines the mixture 
distribution of V* with respect to some W G W, and 



Wp. = arg sup [ D {P^WEwV") dW (0) 
wew J 



(4) 



defines the weighting distribution induced by V* 



Example 1. In the case of a family of u distributions, the parameter set can be written as 
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$ = {01, . . . and the weighting distribution as 



Wv* (0^)) = arg sup w^D i P^\\ wjP^^ , 

-ie[0'il'-'"'-e[0'i]i=i,...,. V j=i,...,. / 



where the supremum is that of the set of weight //-tuples constrained by Yl'i=i "^i — ^■ 



Shulman and Feder (2004) proved that, for alH = 1, . . . , z/. 



V* 



< 1 - = 63%. 



(5) 



The next known result will prove useful in determining the optimal move in the game of 
combining distributions that was mentioned in Sec. [T] 

Lemma 1. The centroid of any nonempty V* V is V*' = P^t>* ^ where Wp* is the weighting 
distribution induced by V* . 



Proof. Two different proofs appear in Haussler (1997) and in Griinwald and Philip Dawid 



(2004). For some history of this result, see Remark g □ 



2.2 Distribution-combination game 

The game sketched in Sec. [T]will now be specified in the above notation. Two sets constrain 
moves in the game: the plausible set V is the subset of V consisting of given plausible 
distributions, and V (^V consists of given combining distributions. The move of Nature is a 
distribution P E V; the move of Chooser is a distribution P E V; the move of Combiner is 
a distribution P+ G V. Chooser and Combiner are called statisticians. If Pi is the move of 
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one statistician and P2 is that of the other, then the amount of utihty paid to the latter is 
the pair 

U (P, Pi, P2) = (-D (P\\P2) , D (P\ |Pi ^ P2) ) , (6) 



understood in terms of preferring v = (f 1, f2) G [0, 00) x [0, 00) over u = ("Ui, ^2) ^ [0, 00) x 
[0, 00) if and only ii u ^ v. Here, u ^ v means that either Ui < Vi or both Ui = Vi and 
U2 < f2- Such preferences are said to have lexicographic ordering (Remark [s]). 

Thus, the utility paid to Combiner will be U ^P, P, P^ j and that paid to Chooser will 
be U ^P, P+, pj . The utility paid to Nature will also be U ^P, P^, pj , with the implication 
that it is to the advantage of Nature and Chooser to act as a coalition with move (P, P 



(von Neumann and Morgenstern, 1953, Ch. 5). Although that reduces the three-player game 
to a two-player game of the coalition versus Combiner, it is not necessarily of zero sum. 

The combination of the distributions in V with truth constrained by V is defined as 
Combiner's optimal move in the game. Since the utility paid to the Nature-Chooser coalition 
is U ^P, P+, pj , Combiner's best move may be written as 

P+ = arg sCp u(Pq,Pq,q) (7) 

QeV:{PQ,PQ)eVQ 

for all Q where sup- is the least upper bound according to ^, and 

PQ = arg sCp UiP',Q,P"). (8) 

While P+ is not necessarily a plausible distribution, it is typically at the center of the 
plausible set: 

Theorem 1. Let P+ denote the combination of the distributions in V with truth constrained 
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byV. IfVnV ^ 0, then 

p+ = pnV = P^^r^v^ (9) 

where V (IV is the centroid of V CiV, and W^^p is the weighting distribution induced by 
V nV, as defined by eq. Q. 

Let A denote an action space. A decision made by taking the action a & A that minimizes 
the expectation value of a loss function L : H x ^ — > M with respect to is optimal in the 
game when the utility function of eq. ^ is replaced with 

-D (^P\\P2),D(^P\\P^^P2),- j L{^,a)dP2{0y 

The latter utility function is understood in terms of the lexicographic ordering relation =<;, 
which is defined such that {ui,U2, u-^) ^ (f i, f2, "^3) if and only if one of the following is true: 
(i) Ui < f 1; (ii) Ui = Vi and U2 < ^2; (hi) Ui = Vi, U2 = V2, and M3 < ^3. On related orderings 
in the literature, see Remark |3| 

2.3 Combining discrete distributions 

Now let V denote the set of probability distributions on (S, 2^) , where S is a finite set 
written asS = {0,l,...,|E!| — 1} without loss of generality. Then the information divergence 
of P with respect to Q reduces to 

i,(P||0, = gP(H).og^. 

For any P G P and ^ ~ P, the |S|-tuple T (P) = (P = 0) , P = 1) , . . . , P = |S| - 1)) 
will be called the tuple representing P. 
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Let V* denote a nonempty subset of V, and let T {V*) denote the set of tuples repre- 
senting the members of P*, i.e., T {V*) = {T (P*) : P* G V*}. Likewise, noting that the 
map T is one-to-one, the extreme subset of V* is defined as exiV* = {extcoT {V*)) , 
where ext co T {V*) is the set of extreme points of co T {V*), the convex hull of T {V*). The 
extreme subset simplifies the problem of locating a centroid: 

Lemma 2. Let V* denote a nonempty, finite subset ofV. If there are a Q eV and a C > 
such that D (P*\\Q) = C for all P* E extV*, then Q is the centroid ofV*. 

More simplification is possible if at least one of the combining distributions is plausible: 

Theorem 2. Let P+ denote the combination of the distributions in V with truth constrained 
by v. If V n V is nonempty and finite, then P+ = p'^cxtC^rTP)^ where W^ext(pnp) ^'^ 
weighting distribution induced Q by ext iV HV) , the extreme subset ofVCiV. 



The combination of a set of probabilities of the same hypothesis or event is simply a 
linear combination or mixture of the highest and lowest of the plausible probabilities in the 
set, with the mixing proportion determined optimally: 

Corollary 1. Let denote the combination of the distributions inV with truth constrained 
by v. Suppose a distributions on ({0, 1} , 2^'^'^^^ are to be combined ^"P = (ind 
let Vo = |p ({0}) : P G p} andP,P eV such that 

P({0})= min P,({0}); P({0}) = max P^{{0}). 

i=l,...,c:P,({0})ePo i=l,...,c;Pi({0})ePo 

If Pi ({0}) e Po for some i G {1, c}, then P+ = w+P + (1 - w+) P, where 

w+ = arg sup (wA (P\ \w) + (1 - w) A (p\ \iv) ) ; (10) 
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A {•\\w) = D (^•\\wP+ {I- w) P) . 

Proof. This follows immediately from Theorem [2] and the definition of an extreme subset. □ 

By eq. (|5]), 37%<w"^<63%, implying that P"*" ({0}) is close to the arithmetic mean 
+ -^({0}) /2; as Shulman and Feder (2004) observed in a coding context. Fig. jlj 
plots versus P({0}) and P({0}), and Fig. 2] compares the resulting P+ ({0}) to the 
arithmetic mean, the geometric mean, and the harmonic mean of P({0}) and P({0}). 

The next result is important for multiple hypothesis testing and, more generally, for 
combining probabilities of independent events rather than entire distributions. 

Corollary 2. Let ^ = (^1, . . . ,C,n), where C,j is a Bernoulli random variable and C,j is inde- 
pendent of C,j for all j, J = 1, . . . N . (The Bernoulli distributions need not be identical: in 
general, each has a different probability P {^i = 0) of failure. Every P eV has a one-to-one 
correspondence to a tuple ^P (,^1 = 0) , . . . , P (^at = 0)^^ Assuming V nV is nonempty and 
finite, let Pi denote the ith of the v members of ext [v fl = |Pi, • • ■ , Pi/| ■ -(f '^he con- 
straints are in the form of lower and upper probabilities P_qi, ■ ■ ■ , P_q ^ and Po,i, • • • , Po,n 
such that 

r=\PeV: Po, < P(0 = 0) < Po„J G {1, . . . ,iV}| , (11) 



then the set of combining distributions that satisfy the constraints is 



'Pnv = [pe'P:Po,<P (0 = 0) < Po„ J e {1, . . . , iV}} . 



(12) 



Further, P+ = p'^™t(-pnp) ^/^g combination of the distributions in V with truth constrained 
by V , where 
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N 1 



g sup ^Wi^^Pi {^j = k) log 



P^ (0 = k) 



i=l j=l k=0 



p{w^,...,w.) = ^) 



(13) 



with the supremum over Q2J = {(w'^, . . . ^w'J) G (0, 1]'' : XliLi "^i ~ -'-}• 



Proof. Eq. (12) is obvious from eq. (11). By the independence condition, the chain rule for 



information divergence (see, e.g.. Cover and Thomas, 2006, Theorem 2.5.3) reduces finding 



the weighting distribution for according to Theorem |2] to finding 



u N 



g sup J^'^'T.^ = •) I li^^""-'""^ (0 



which, with eq. (IT]), yields eq. (13). 



□ 



3 Distribution combination for statistical inference 



Whereas much of the literature focuses on combining priors from experts. Sec. |3.1| instead 
focuses on combining posteriors. The posterior-inference setting enables the combination not 
only of Bayesian posterior distributions but also of confidence intervals and p- values encoded 



as frequentist posterior distributions, as will be explained in Sec. 3.2 



3.1 Combining posterior distributions and probabilities 

In the context of posterior statistical inference, ^ represents a random parameter. Further, 
all distributions in P, the set of plausible distributions of ^, and P, the set of combining 
distributions of ^, are posterior with respect to the same data set x. All distributions in V 
and all Bayesian posteriors in V are conditional on X = where, for any such posterior, 
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the distribution of X depends on the random value of the parameter drawn from some prior. 
V may also contain non-Bayesian posteriors such as a confidence posterior or a distribution 
derived from a confidence posterior ( ^3.2[ ). 

Accordingly, the information divergence D {P\\Q) becomes the amount of information 
that would be gained by replacing any posterior Q with the true posterior P. That interpre- 
tation leads to viewing D {P'\\P" -w Q) as the amount of information gained for statistical 
inference by using some posterior Q E V rather than a given posterior P" G V when the 
plausible posterior is P' G V. Thus, D {P'\\P" -w Q) defines the inferential gain of Q relative 



to P" given P' ( |BickeH |2011a||c| . 



The posterior distributions are combined according to Sec. |2.2 , using Theorem [T] when- 



ever possible. If V represents the uncertainty around a Bayesian posterior P G V, as in 



Gajdos et al. (2004) and Bickel (2011c), then P is included in V as one of the distributions 
to combine. The resulting combination P"*" is then used to minimize expected loss in order 
to optimize actions such as point, interval, and function estimators and predictors. 

In model selection and hypothesis testing, C, has or 1 as its realized value, with ,^ = if 
a reduced model or null hypothesis is true or ^ = 1 if a full model or alternative hypothesis 
is true. Corollary [l] applies to this problem with Vq as the set of feasible null hypothesis 
posterior probabilities and Vq = |-Pi ({0}) , Pc ({0})| as the set of null hypothesis poste- 
rior probabilities to be combined, where Pj ({0}) = Pj = 0) is the ith posterior probability 
that the null hypothesis is true. Thus, the combination posterior probability that the null 
hypothesis is true is 

P+ = 0) = w+P = 0) + (1 - w+) P = 0) . (14) 
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Here, P_{C, = 0) = P (0) and P = 0) = P (0) are respectively the lowest and highest null 
hypothesis posterior probabilities that are in VoHVo, presently assumed to have at least one 



member, and is determined by eq. (10). The same idea applies to multiple hypothesis 
testing, as will be seen in Sec. |4j 



3.2 Combining frequentist posteriors 



3.2.1 Confidence posteriors 



As mentioned in Sec. 



3.2 



the set V of posterior combining distributions can include those 
representing confidence intervals and p-values. To emphasize their comparability to Bayesian 
posterior distributions, these frequentist distributions are called "confidence posterior distri- 



butions" (Bickel, 2011b), also known as "confidence distributions" (see, e.g., Schweder and 



Hjort 2002). 



Briefly, a confidence posterior distribution that corresponds to a set of nested confidence 
intervals evaluated for the observed data is defined as the probability distribution according 
to which the posterior probability that the interest parameter lies within a confidence interval 
is equal to the confidence level of the interval. For example, if a 95% confidence interval for a 
real parameter is [—2.2, 1.7], then there is a 95% posterior probability that the parameter is 
between —2.2 and 1.7 according to the confidence posterior. The same confidence posterior 
for the data also assigns posterior probability to parameter intervals of interest according to 



the confidence levels of the matching confidence intervals, e.g., Bickel (2011d) considered a 



one-sided p- value as the posterior probability that the population mean is in (— oo, 0) rather 



than [0, oo). Efron and Tibshirani (1998) and Polansky (2007) considered exact confidence 



posterior probabilities of intervals or other regions specified before observing the data as 
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ideal cases of "attained confidence levels" and "observed confidence levels," respectively. 
Bickel ( 2011bp ) proposed taking actions that minimize expected loss with respect to 



a confidence posterior distribution. Since that distribution is a Kolmogorov probability 
distribution of the parameter of interest, such actions comply with most axiomatic systems 



usually considered Bayesian, e.g., the systems of von Neumann and Morgenstern (1953) and 



Savage (1954). A human or artificial intelligent agent that bets and makes other decisions in 



accordance with minimizing expected loss with respect to a confidence posterior corresponds 
to equating the confidence level of a confidence interval with the agent's level of belief that 



the parameter value lies in the interval (Bickel, 2011b). 

The decision-theoretic framework makes confidence posteriors suitable as members of V, 
the set of combining distributions, according to the methodology of Sec. |2} They can be 
combined to not only with each other, but also with other parameter distributions such as 
Bayesian posteriors based on proper or improper priors. The same applies to a probability 
distribution of a function of a parameter drawn from a confidence posterior. Such posteriors 
have been used to equate posterior probabilities of simple null hypotheses with two-sided 
p- values (van Berkum et al. , 1996, Bickel, 2011e||a||c ). For terminological economy, these 
posteriors will now be called "confidence posteriors." 



To approximate Bayesian model averaging. Good (1958) recommended a weighted har- 
monic mean of p-values computed from the same data, provided that they range from 10"^ 
to 0.2, the limits used in Fig. |2j Since the one-sided or two-sided p-values are posterior 



probabilities of the null hypothesis derived from different confidence posteriors, eqs. (10) 



and (14) can be applied with P = 0) and P = 0) as the lowest and highest p-values that 
are plausible as null hypothesis probabilities, i.e., that are in Vq. The resulting combina- 



tion p- value differs from that of Good (1958) in two respects: the mean is arithmetic (14) 
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rather than harmonic and, even more important, the weights are optimal for the game (10) 



rather than subjective. The use of optimal weights leads to preparing for the worst case by 
averaging only the two most extreme p-values rather than all of them. 

Example 2. Given a small sample of data drawn from a distribution that might be approx- 
imately normal, let p'^^^ and p^^^ denote the two-sided p-values according to the t-test and 
the Wilcoxon signed-rank test, respectively; p^^'> < p^'^\ Under conditions often applicable to 



simple (point) hypothesis testing with a diffuse alternative hypothesis (Sellke et al. , 2001), 



the plausible set of posterior probabilities of the null hypothesis is Vq 
bound 



with lower 



Pn 



1 



-prior 



Po™V') (X) log [l/p(2) (x)] 



pprior 



• prior 

where A is the minimum operator, and Pq is the lowest plausible prior probability that 



the null hypothesis is true (Bickel, 2011c). Then the combined p-value P+ ({0}) is Pq if 
p(2) ^ p^^ p(2) p{i) < < and, according to Corollary [l| the weighted arithmetic 
mean w'^p^^^ + (1 — w'^)p^'^^ if p^^^ > Pg with the weights and (1 — w~^) fixed by eq. 



(10). Of the three cases, the third yields a combined p-value that differs from the blended 



posterior probability suggested in Bickel (2011a). ▲ 



When V consists of a single confidence posterior P, the resulting P"*", degenerate as a 
"combination" of a single distribution, is better viewed as a solution to the problem of blend- 
ing frequentist inference with constraints encoded as the Bayesian posteriors that constitute 



V. That solution in general differs from the minimax-type solutions considered (Bickel 



2011a|c). Under P E V and the convexity of V, they lead to the P that minimizes the 



information divergence D yP\\Pj, which is dual to the Q that minimizes D yP\\Qj, the 
information divergence that is minimized (15) when maximizing eq. ^ according to the 
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game introduced in Sec. 



2.2 



3.2.2 Multiple comparison procedures 

The distribution-combination theory is now apphed to adjustments for muhiple comparisons 
by formahzing the observation that p-values are often adjusted to the extent of prior belief 
in the null hypothesis. That is, multiple comparison procedures (MCPs) designed to control 
error rates are "most likely to be used, if at all, when most of the individual null hypotheses 



are essentially correct" (Cox, 2006, p. 88). A first-order formalization would take the p- value 
adjusted according to an MCP as the posterior probability of the null hypothesis. To the 
extent that the knowledge or opinion of the agent is such that its decisions would be made to 
minimize the expected loss with respect to that posterior distribution, the use of the MCP 
is warranted. In this interpretation, combining p-values across different MCPs is equivalent 
to combining the posterior distributions that represent the corresponding opinions. 

Example 3. The Bonferroni procedure controls the family-wise error rate, the probability 
that one or more true null hypotheses will be rejected, at any level a G [0,1]. That is 
accomplished on the basis of p-values pi, . . . ,PAr by rejecting the ith of null hypotheses 
if the adjusted p-value N-pi A 1 is less than a. Thus, the posterior probabilities generated 
by the Bonferroni procedure are appropriate only when the prior probability of each null 



hypothesis is inversely proportional to the number of tests. As Westfall et al. (1997) pointed 
out, the "Bonferroni method is based upon the implicit presumption of a moderate degree of 
belief in the event" that all null hypotheses under consideration are true and that the prior 
truth values of the hypotheses are approximately independent. 

Accordingly, the Bonferroni method is widely used to analyze genome-wide association 
data, largely because only an extremely small fraction tti of the hundreds of thousands of 
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markers tested are thought to be associated with the trait of interest. Wellcome Trust Case 
Control Consortium (2007) guessed 10^^ < tti < 10^^, interpreting vri as the prior probability 



of association between a given marker and the trait. The corresponding range of posterior 



probabilities and Bayes factors such as those of Wellcome Trust Case Control Consortium 



(2007) would define V for ruling out MCPs that yield implausible results (Theorem |2j). (On 
the other hand, some evidence that tti > 10~^ is now available in preliminary estimates (Yang 



and Bickel, 2010) and in indications that thousands of small-effect SNPs may be associated 



with any particular disease (Gibson 2010 Park et al. 2010).) ▲ 



By assuming adjusted p-values are equal to independent posterior probabilities of the 
null hypotheses, the methodology of Corollary |2] can combine the results of various MCPs. 



4 Large-scale case study 



Using microarray technology. Alba et al. (2005) measured the levels of tomato gene expres- 



sion for 13,440 genes at three days after the breaker stage of ripening, but one or more 
measurements were missing for 7337 genes. The data available across all n = 6 biological 
replicates for = 6103 of the genes illustrate the methodology of Sees. [2]and|3j 

For j = 1, . . . , A^, the logarithms of the measured ratios of mutant expression to wild-type 
expression in the jth gene were modeled as realizations of a normal variate and are denoted 
by the n-tuple Xj. Because the mean and variance are unknown, the one-sample i-test was 
used to test the null hypothesis {^j = 0) that the population mean is against the two-sided 
alternative hypothesis {C,j = 1) that there is differential expression of the jth gene between 
mutant and wild type, i.e., that the mutation affects the expression of gene j. 

The posterior probability of a null hypothesis conditional on the p- value is called its local 
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false discovery rate (LFDR) (Efron et al. , 2001 ). Three very different methods (i = 1, 2, 3) of 
estimating the LFDR were considered. The first two methods are based on fitting a histogram 



of transformed p- values that is described by Efron (2007). They differ in that whereas one 
assumes the p- value has a uniform distribution under the null hypothesis (z = 1), the other 
estimates the p- value null distribution by maximizing a truncated likelihood function (z = 2). 
Each method has its own advantages (Q. The distributions are called the theoretical null 



and the empirical null, respectively. The third method for combination is the q- value (Storey 



2002), here defined according to the algorithm of Benjamini and Hochberg (1995) as the 



lowest false discovery rate at which a null hypothesis will be rejected (z = 3). While the 
q-value was not originally intended as an estimator of the LFDR, it is included here since 



its negative bias as such an estimator (Hong et al. , 2009) may have a corrective effect on the 
positive bias (conservatism) of the first two LFDR estimators. 

For this application, V is the set of all probability distributions on ^{0,1)^,2^°'^^^ ^. 
Corresponding to those three methods, let Pi, P2, and P3 denote the members of V such that 
the ith estimate of the LFDR of the jth gene is Pi {C,j =0). To combine the three methods, 
V = P2, -Psj is taken as the set of z/ = 3 combining distributions. 

For the jth gene, f {t (xj) ;6j) will represent the probability density of the Student t 
statistic t (xj), where 6j is the reciprocal of the coefficient of variation and, for any G M, 
/ (•; 6) is the probability density function of |T| when T has the noncentral t distribution 
of n — 1 degrees of freedom and noncentrality parameter ^/n6. The set V of plausible 
distributions will be determined on the basis of {Lj {•) = f {t (xj) ; •) : j = 1, . . . , A^}, the set 
of likelihood functions. The plausible distributions are also based on tTq = 80%, an assumed 
lower bound on the proportion of genes that are not differentially expressed. By Bayes's 
theorem, the posterior odds of the jth null hypothesis is the product of the prior odds, which 
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is the least tTq/ (1 — ttq), and the Bayes factor, which must be at least Lj (0) / maxg^o Lj {9). 
Thus, for gene j, a lower bound Oj of the posterior odds is the product of the last two 
quantities, and a lower bound of the LFDR is P_{^j = 0) = Oj/ (l + Oj)- In the notation 
of Corollary 2 Pq^ = P_{ij = 0) and, trivially, Poj = 1 for all J = l,...,iV. Thus, the 



plausible set specified by eq. (11) consists of the posterior distributions satisfying the lower 
bound derived from the likelihood functions and vTq = 80%. 

The horizontal axis and straight line in Fig. [3] represent P, and the intermediate, 
highest, and lowest dashed curves represent Pi, P2, and P3, respectively. Since some of 
the q-values are less than the lower bound : P3 (^j = 0) < P (^j = 0)j but all of the 
other LFDR estimates satisfy the bound {i = 1,2; Vj : Pj {^j = 0) > P_{^j = 0)j, the for- 
mer are excluded when computing the combined estimates according to eq. (12), in which 
V n V = ^Pi,p2^. Since there are only two distributions, each corresponds to an ex- 
treme point, leading to ext fl "P j = |Pi,P2| in eq. (13). Thereby, the combined 
distribution is numerically found to be the linear combination P"*" = WiPi + W2P2 with 
Wi = 0.43 and W2 = 0.57. By implication, the game-optimal LFDR estimate for the jth 
gene is P'^ {^j = 0) = WiPi {^j = 0) + W2P2 {^j = 0). Those combined estimates are plotted 
as the solid curve in Fig. |3| 
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Appendix A: Remarks 



Remark 1. (Sec. [T]) Since formulating the distribution combination problem in terms of a 
game is unconventional, it is worth noting that game theory laid the foundations of the two 
dominant schools of statistical decision theory. The maximum-expected-payoff solution of 



a one-player game (von Neumann and Morgenstern, 1953, Ch. I) led to axiomatic systems 



that support Bayesian statistics (e.g., Savage, 1954). Likewise, the worst-case (minimax 



solutions of certain two-player zero-sum games (von Neumann and Morgenstern, 1953, Ch. 



Ill) led to frequentist decision theory (Wald, 1961). 



Remark 2. (Sec. 2.1) The discrete-distribution version of Lemma [T| the main result of the 



"redundancy-capacity theorem," was presented by R. G. Gallager in 1974 (Ryabko, 1981 



Editor's Note) and published by Ryabko (1979) and Davisson and Leon-Garcia (1980); cf. 



Gallager (1979). Cover and Thomas (2006 Theorem 13.1.1), Rissanen (2007 §5.2.1), and 



Csiszar and Korner (2011, Problem 8.1) provide useful introductions. 



Remark 3. (Sec. 2.2) Previous instances of lexicographically maximizing expected utility 



with respect to an optimal probability distribution include the use of the least informative 



prior (Seidenfeld 



2004) and the use of the posterior P2 used to maximize D {P\\Pi -w P2 
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in a two-player zero-sum game (Bickel, 2011c). On lexicographic decision making in other 
contexts, see 



( |1986a| §§5.7, 6.9), |Li^ ( p^QSGb) ), and [Keeney and Raiffa| ( p93| §3.3.1). 



Ciesielski (1997, Ch. 4) and Koshy (2004, Ch. 7) provide more formal set-theoretic exposi- 
tions of lexicographic ordering. 



Appendix B: Additional proofs 



Proof of Theorem [T] 

For any Q E V, eqs. ^ and ^ yield 



sup U {P', Q, P") 

{P',P")&Vxf 



sup {-D{P'\\P"),D{P'\\Q^P")) 

{P',P")&VxV 

scp {-DiP'\\P'),DiP'\\Q-^P')) 
P'evnv 

sup [D{P'\\Q)-D{P'\\P')]= sup D{P'\\Q) 

P'evnv P'evnv 



Thus, by eqs. (§, (§, (|7|), and (§, 



argsupf/ arg sup (P'| IQ) , arg sup D{P'\\Q),Q 
Qev y P'evnv P'evnv 

arg sup I -D I arg sup D {P'\\Q) \\Q 
Qev \ \ P'evnv j 

arg inf sup D[P'\\Q). 



(15) 



Hence, according to eq. ([s]), P+ is V DV, the centroid oi V Hp. Since V HV ^ by 
assumption, the conditions of Lemma [T] are satisfied. 
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Proof of Lemma [2] 



As an immediate consequence of what Nakagawa and Kanaya ( 1988 ) label "Theorem (Csiszar 



and "Theorem 1, 



min maxD(P'\\P") = C. 
p"eVP'ev* 



By definition, the centroid is the solution of that minimax problem ([s]). 



Proof of Theorem [2] 



Lemma 



implies that ext {VnVj, the centroid of ext {VnVj, is p^-H^-np). Since, ac- 
cording to the definition of an extreme point and the definition of a centroid (|3]), it is 
not possible that there exist a P' G ext [V HV) and a P" G ext [V DV) such that 



D P'Wext {VnV] <D { P"\\ext {VnV] , it follows that 



D i _P*| |_p'^<=''t('Pnp) 



max D (P'\\P^^^<^"^) 
P'eext(pnp) ^ 



for all P* G ext fl Vj . According to Lemma 
That centroid is P+ by Theorem [!} 



pW^cxt(pnP) is-pnV, the centroid of VnV. 
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0.5 




miK. probabilily 



Figure 1: Optimal weight versus P({0}) and P({0}), the lowest and highest of the 
plausible probabilities to be combined (10), labeled here as "min. probability" and "max. 

probability," respectively. The combination probability is P+ (0) = w~^P_ (0) + (1 — w^) P (0) 
according to Corollary [T| 
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probability to combine with 0.05 

Figure 2: Three equal- weight averages of probabihties and the game-theoretic combination 
of two probabihties based on CoroUaryjl} The two probabihties that are combined are drawn 
in sohd gray. 
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lowest plausible probability 



Figure 3: Combination of estimates of local false discovery rates, which are empirical Bayes 
posterior probabilities that the null hypotheses of equivalent gene expression are true. 
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